An Unsupervised Text Normalization Architecture for Turkish Language

نویسندگان

  • Savas Yildirim
  • Tugba Yildiz
چکیده

A variety of applications on the problem of short-text messages require text normalization process that transforms ill-formed words into standard ones. Recently, many successful approaches have been applied to text normalization especially for social media text. Since each natural language has its own difficulties and barriers, we need to design an architecture to normalize short text messages in Turkish language which has an morphologically rich agglutinative structure. The model proceeds from simple solutions towards more complicated and sophisticated ones to reduce time complexity. A variety of techniques from lexical similarity to n-gram language modeling have been evaluated by exploiting several resources such as high quality corpus, morphological parser and dictionaries. We demonstrate that unsupervised text normalization architecture adapting both lexical and semantic similarity for Turkish domain has shown efficient results that might contribute to other studies.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Cascaded Approach for Social Media Text Normalization of Turkish

Text normalization is an indispensable stage for natural language processing of social media data with available NLP tools. We divide the normalization problem into 7 categories, namely; letter case transformation, replacement rules & lexicon lookup, proper noun detection, deasciification, vowel restoration, accent normalization and spelling correction. We propose a cascaded approach where each...

متن کامل

An Unsupervised Model for Text Message Normalization

Cell phone text messaging users express themselves briefly and colloquially using a variety of creative forms. We analyze a sample of creative, non-standard text message word forms to determine frequent word formation processes in texting language. Drawing on these observations, we construct an unsupervised noisy-channel model for text message normalization. On a test set of 303 text message fo...

متن کامل

A Graph-based Approach for Contextual Text Normalization

The informal nature of social media text renders it very difficult to be automatically processed by natural language processing tools. Text normalization, which corresponds to restoring the non-standard words to their canonical forms, provides a solution to this challenge. We introduce an unsupervised text normalization approach that utilizes not only lexical, but also contextual and grammatica...

متن کامل

A Log-Linear Model for Unsupervised Text Normalization

We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be imprac...

متن کامل

Building a Turkish ASR system with minimal resources

We present an open-vocabulary Turkish news transcription system built with almost no language-specific resources. Our acoustic models are bootstrapped from those of a well trained source language (Italian), without using any Turkish transcribed data. For language modeling, we apply unsupervised word segmentation induced with a state-of-the-art technique (Creutz and Lagus, 2005) and we introduce...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Research in Computing Science

دوره 90  شماره 

صفحات  -

تاریخ انتشار 2015